46 research outputs found

    Pattern Mining for Named Entity Recognition

    Get PDF
    Revised selected paper of the LTC'2011 conferenceInternational audienceMany evaluation campaigns have shown that knowledge-based and data-driven approaches remain equally competitive for Named Entity Recognition. Our research team has developed CasEN, a symbolic system based on finite state tranducers, which achieved promising results during the Ester2 French-speaking evaluation campaign. Despite these encouraging results, manually extending the coverage of such a hand-crafted system is a difficult task. In this paper, we present a novel approach based on pattern mining for NER and to supplement our system's knowledge base. The system, mXS, exhaustively searches for hierarchical sequential patterns, that aim at detecting Named Entity boundaries. We assess their efficiency by using such patterns in a standalone mode and in combination with our existing system

    Who are you, you who speak? Transducer cascades for information retrieval

    Get PDF
    International audienceThis paper deals with a survey corpus. We present information retrieval about the speaker. We used finite state transducer cascades and we present here detailed results with an evaluation. This work is part of a French project to enhance the corpus ESLO (sociolinguistic survey taken in the city of Orléans). This survey has been realized in 1968 and the project is to save records in computer format, to transcribe them and to increase the transcription with annotations in XML format. This work was supported by a French ANR contract (ANR-06-CORP-023) and by European fund from Région Centre (FEDER). The corpus represent a collection of 200 interviews with the questions about the life in the city of Orléans: How long have you lived in Orléans for?, What led you to live in Orléans?, Do you like living in Orléans?, etc. and questions about the occupation or the family of the speaker, completed by recordings within a professional or private context. The recording situations are different: interviews, discussions between friends, recordings in microphone hidden, interviews with the political, academic and religious personalities, conversations between a social worker and parents in Psycho Medical Center of Orleans. In total, we have 300 hours of speech estimated to 4,500,000 words. More precisely, we worked on almost 120 transcribed hours representing 112 Transcriber XML files and 32 577 Kb. We worked on 105 files (31 004 Kb) and we evaluated the results on 7 files (1 573 Kb-5.1%). The transcription files have no punctuation marks, but the first letter of proper names is capitalized and acronyms are fully capitalized. We used the CasSys system (Friburger, Maurel, 2004) that computes texts with transducer cascades (Abney, 1996). The cascades we used are hand built: each transducer describes a local grammar for the recognition of some entities. Some times this recognition needs the succession of two or more transducers, in a specific order. More precisely, we used two cascades; the first one, for named entity recognition, was built some years ago for a newspaper corpus and we adapted it to oral corpus in the project; the second one aimed at discovering information about the speaker in three domains: origin (is he/she Orléans city native or where he/she comes from?), family (is he/she married, with children or not?) and occupation (what is his/her occupation? where does he/she work?). We called this information designating entities. This second cascade was specifically built for the project. CasSys computes transducers with Unitex software (Paumier, 2003) that needs to segment the text by preprocessing. For written text, this segmentation usually uses sentence boundary detection (Friburger and al., 2000). In our corpus there is no punctuation. So we have chosen to use XML Transcriber tags to do the segmentation and also to hide the inside of the tag for the named entity task, sometimes ambiguous with context entities (Dister, 2007)

    ESLO: from transcription to speakers' personal information annotation

    Get PDF
    This paper presents the preliminary works to put online a French oral corpus and its transcription. This corpus is the Socio-Linguistic Survey in Orleans, realized in 1968. First, we numerized the corpus, then we handwritten transcribed it with the Transcriber software adding different tags about speakers, time, noise, etc. Each document (audio file and XML file of the transcription) was described by a set of metadata stored in an XML format to allow an easy consultation. Second, we added different levels of annotations, recognition of named entities and annotation of personal information about speakers. This two annotation tasks used the CasSys system of transducer cascades. We used and modified a first cascade to recognize named entities. Then we built a second cascade to annote the designating entities, i.e. information about the speaker. These second cascade parsed the named entity annotated corpus. The objective is to locate information about the speaker and, also, what kind of information can designate him/her. These two cascades was evaluated with precision and recall measures.Comment: LREC2010, Malta (2010

    Enrichment of Renaissance texts with proper names

    Get PDF
    International audienceThe Renom project proposes to enrich Renaissance texts by proper names. These texts present two new challenges: great diversity due to various spellings of words; numerous XML-TEI tags to save the exact format of original edition. The task consisted to add Named Entity tags to this format tagging with generally the left context and sometimes the right context of a name. To do that, we improved the free and open source program CasSys to parse texts with Unitex graph cascades and we built dictionaries and specific cascades. The slot error rate was 6.1%. Proper Names and maps. were to allow navigating into. So, this paper deals with Named Entity Recognition in Renaissance texts

    Automatic rich annotation of large corpus of conversational transcribed speech

    Get PDF
    International audienceThis paper describes the use of the CasSys platform in order to achieve the chunking of conversational speech transcripts by means of cascades of Unitex transducers. Our system is involved in the EPAC project of the French National agency of Research (ANR). The aim of this project is to develop robust methods for the annotation of audio/multimedia document collections which contains conversational speech sequences such as TV or radio programs. At first, this paper presents the EPAC project and the adaptation of a former chunking system (Romus) which was developed in the restricted framework of dedicated spoken man-machine dialogue. Then, it describes the problems that are arising due to 1) spontaneous speech disfluencies and 2) errors for the previous stages of processing (automatic speech recognition and POS tagging)

    Fouille de règles d'annotation pour la reconnaissance d'entités nommées

    Get PDF
    National audienceComme pour de nombreuses autres problématiques TAL, la reconnaissance d'entités nommées met en jeu aussi bien des systèmes à base de connaissances que des systèmes guidés par les données. Dans cet article, nous proposons une approche médiane par l'adaptation de méthodes issues de l'extraction de connaissances. Notre système, mXS, intègre des techniques de fouille séquentielle hiérarchique pour la détection des entités nommées. Le système adopte une démarche centrée sur les données pour extraire des motifs symboliques. Il repose par ailleurs sur une stratégie originale qui consiste à rechercher séparément le début et la fin des entités. Cette approche présente l'intérêt de conserver une certaine robustesse par rapport aux bruit et disfluences. Elle est adaptée au cadre applicatif visé par le système : la détection d'entités nommées au sein de flux de parole conversationnelle transcrite automatiquement. À ce titre, mXS a participé à la campagne d'évaluation ETAPE où il a présenté de bons résultats. Cet article présente le fonctionnement de mXS et ses performances sur les jeux de données issus de deux campagnes d'évaluation francophones (ESTER 2 et ETAPE)

    Explorer des corpus à l'aide de CasSys. Application au Corpus d'Orléans

    Get PDF
    International audienceCet article présente un outil d'exploration de corpus, CasSys, facilement paramétrisable par les linguistes, permettant de reconnaître des motifs même complexes et de les baliser, éventuellement par des balises XML. Ce balisage automatique peut ensuite être révisé par un expert. CasSys est donc un outil d'exploration de corpus, mais également d'annotation enrichie semi-supervisée.Deux exemples réels complèteront cette présentation : la recherche des entités nommées du Corpus d'Orléans et l'utilisation de ces entités pour connaître des informations sur les personnes répondant à l'enquête qui constitue ce corpus. Ce travail a bénéficié du financement du projet ANR Variling et d'un projet Feder Région Centre. Il a aussi été testé dans le cadre de l'évaluation Ester2 (campagne d'évaluation des systèmes de transcription enrichie d'émissions radiophoniques)

    Recherche d'information par cascade de graphes Unitex

    Get PDF
    National audienc

    An Analysis of the Performances of the CasEN Named Entities Recognition System in the Ester2 Evaluation Campaign

    Get PDF
    8 pagesIn this paper, we present a detailed and critical analysis of the behaviour of the CasEN named entity recognition system during the French Ester2 evaluation campaign. In this project, CasEN has been confronted with the task of detecting and categorizing named entities in manual and automatic transcriptions of radio broadcastings. At first, we give a general presentation of the Ester2 campaign. Then, we describe our system, based on transducers. Next, we depict how systems were evaluated during this campaign and we report the main official results. Afterwards, we investigate in details the influence of some annotation biases which have significantly affected the estimation of the performances of systems. At last, we conduct an in-depth analysis of the effective errors of the CasEN system, providing us with some useful indications about phenomena that gave rise to errors (e.g. metonymy, encapsulation, detection of right boundaries) and are as many challenges for named entity recognition systems

    Reconnaissance d'entités nommées : enrichissement d'un système à base de connaissances à partir de techniques de fouille de textes

    Get PDF
    International audienceIn this paper, we present and analyze the results obtained by our named entity recognition system, CasEN, during the Ester2 evaluation campaign. We identify on what difficulties our system was the most challenged, which mainly are: out-of-vocabulary words, metonymy and detection of the boundaries of named entities. Next, we propose a direction which may help us for improving performances of our system, by using exhaustive hierarchical and sequential data mining algorithms. This approach aims at extracting patterns corresponding to useful linguistic constructs for recognizing named entities. Finaly, we describe our experiments, give the results we currently obtain and analyze those results
    corecore